INTERSPEECH.2012 - Speech Processing

Total: 35

#1 A study of mutual information for GMM-based spectral conversion [PDF] [Copy] [Kimi1]

Authors: Hsin-Te Hwang ; Yu Tsao ; Hsin-Min Wang ; Yih-Ru Wang ; Sin-Horng Chen

The Gaussian mixture model (GMM)-based method has dominated the field of voice conversion (VC) for last decade. However, the converted spectra are excessively smoothed and thus produce muffled converted sound. In this study, we improve the speech quality by enhancing the dependency between the source (natural sound) and converted feature vectors (converted sound). It is believed that enhancing this dependency can make the converted sound closer to the natural sound. To this end, we propose an integrated maximum a posteriori and mutual information (MAPMI) criterion for parameter generation on spectral conversion. Experimental results demonstrate that the quality of converted speech by the proposed MAPMI method outperforms that by the conventional method in terms of formal listening test.

#2 Bayesian mixture of probabilistic linear regressions for voice conversion [PDF] [Copy] [Kimi1]

Authors: Na Li ; Yu Qiao

The objective of voice conversion is to transform the voice of one speaker to make it sound like another. The GMM-based statistical mapping technique has been proved to be an efficient method for converting voices. We generalized this technique to Mixture of Probabilistic Liner Regressions (MPLR) by using general mixture model of source vectors. In this paper, we improve MPLR by considering a prior for the transformation parameters of liner regressions, which leads to Bayesian Mixture of Probabilistic Liner Regressions (BMPLR). BMPLR has the effectiveness and robustness of Bayesian inference. Especially when the number of training data is limited and the mixture number is larger, BMPLR can largely relieve the overfitting problem. This paper presents two formulations for BMPLR, depending on how to model noise in probabilistic regression function. In addition, we derive equations for MAP estimation of transformation parameters. We examine the proposed method on voice conversion of Japanese utterances. The experimental results exhibit that BMPLR achieves better performance than MPLR.

#3 Iterative MMSE estimation of vocal tract length normalization factors for voice transformation [PDF] [Copy] [Kimi1]

Authors: Daniel Erro ; Eva Navas ; Inma Hernáez

We present a method that determines the optimal configuration of a bilinear vocal tract length normalization function to transform the frequency axis of one voice according to a specific target voice. Given a number of parallel utterances of the involved speakers, the single parameter of this function can be calculated through an iterative procedure by minimizing an objective error measure defined in the cepstral domain. This method is also applicable when multiple warping classes are considered, and it can be complemented with amplitude correction filters. The resulting physically motivated cepstral transformation results in highly satisfactory conversion accuracy and improved quality with respect to standard satistical systems.

#4 An HMM approach to residual estimation for high resolution voice conversion [PDF] [Copy] [Kimi1]

Authors: Winston Percybrooks ; Elliot Moore

Voice conversion systems aim to process speech from a source speaker so it would be perceived as spoken by a target speaker. This paper presents a procedure to improve high resolution voice conversion by modifying the algorithm used for residual estimation. The proposed residual estimation algorithm exploits the temporal dependencies between residuals in consecutive speech frames using a hidden Markov model. A previous residual estimation technique based on Gaussian mixtures is used as comparison. Both algorithms are subjected to tests to measure perceived identity conversion and converted speech quality. It was found that the proposed algorithm generates converted speech with significantly better quality without degraded identity conversion performance with respect to the baseline, working particularly well for female target speakers and cross-gender conversions.

#5 Implementation of computationally efficient real-time voice conversion [PDF] [Copy] [Kimi1]

Authors: Tomoki Toda ; Takashi Muramatsu ; Hideki Banno

This paper presents an implementation of real-time processing of statistical voice conversion (VC) based on Gaussian mixture models (GMMs). To develop VC applications for enhancing our human-to-human speech communication, it is essential to implement real-time conversion processing. Moreover, it is useful to further reduce computational complexity of the conversion processing for making VC applications available in limited resources. In this paper, we propose an implementation method of real-time VC based on low-delay conversion processing considering dynamic features and a global variance. Moreover, we also propose computationally efficient VC processing based on fast source feature extraction and diagonalization of full covariance matrices. Some experimental results are presented to show that the proposed methods works reasonably well.

#6 Effects of speaker adaptive training on tensor-based arbitrary speaker conversion [PDF] [Copy] [Kimi1]

Authors: Daisuke Saito ; Nobuaki Minematsu ; Keikichi Hirose

This paper introduces speaker adaptive training techniques to tensor-based arbitrary speaker conversion. In voice conversion studies, realization of conversion from/to an arbitrary speakerfs voice is one of the important objectives. For this purpose, eigenvoice conversion (EVC) based on an eigenvoice Gaussian mixture model (EV-GMM) was proposed. Although the EVC can effectively construct the conversion model for arbitrary target speakers using only a few utterances, it does not effectively improve the performance even when using a lot of adaptation data, because of an inherent problem in GMM supervectors. We previously proposed tensor-based speaker space as the solution for this problem, and realized more flexible control of speaker characteristics. In this paper, for larger improvement of the performance of VC, speaker adaptive training and tensorbased speaker representation are integrated. The proposed method can construct the flexible and precise conversion model, and experimental results of one-to-many voice conversion demonstrate the effectiveness of the proposed approach.

#7 Low-SNR, speaker-dependent speech enhancement using GMMs and MFCCs [PDF] [Copy] [Kimi1]

Authors: Laura Boucheron ; Phillip L. De Leon

In this paper, we propose a two-stage speech enhancement technique. In the training stage, a Gaussian Mixture Model (GMM) of the melfrequency cepstral coefficients (MFCCs) of a user's clean speech is computed wherein the component densities of the GMM serve to model the user's "acoustic classes." In the enhancement stage, MFCCs from a noisy speech signal are computed and the underlying clean acoustic class is identified via a maximum a posteriori (MAP) decision and a novel mapping matrix. The associated GMM parameters are then used to estimate the MFCCs of the clean speech from the MFCCs of the noisy speech. Finally, the estimated MFCCs are transformed back to a time-domain waveform. Our results show that we can improve PESQ in environments as low as -10 dB SNR.

#8 Can modified casual speech reach the intelligibility of clear speech? [PDF] [Copy] [Kimi1]

Authors: Maria Koutsogiannaki ; Michelle Pettinato ; Cassie Mayo ; Varvara Kandia ; Yannis Stylianou

Clear speech is a speaking style adopted by speakers in an attempt to maximize the clarity of their speech and is proven to be more intelligible than casual speech. This work focuses on modifying casual speech to sound as intelligible as clear speech. To that purpose, a database of read speech sentences, recorded both in clear and in casual speaking style is analyzed. Based on the analysis of the database, speaking rate is the prevalent characteristic that differs between the two speaking styles. To examine if speaking rate plays role in the intelligibility advantage of clear speech, clear speech signals are time scaled in higher speaking rate to match the duration of the casual signals. Subjective and objective measures on time-scaled clear speech and casual speech, revealed that the low speaking rate is the main factor that contributes to clear speech intelligibility. However, when attempting to expand casual signals in time, the intelligibility of the casual signals deteriorates. Since time scale modifications on casual signals seem inappropriate of increasing intelligibility, spectral transformations on the casual speech signals are performed. Subjective and objective tests show a significant enhancement of the intelligibility of casual signals, reaching the intelligibility scores of clear signals.

#9 Speech enhancement using sparse convolutive non-negative matrix factorization with basis adaptation [PDF] [Copy] [Kimi1]

Authors: Michael A. Carlin ; Nicolas Malyska ; Thomas F. Quatieri

We introduce a framework for speech enhancement based on convolutive non-negative matrix factorization that leverages available speech data to enhance arbitrary noisy utterances with no a priori knowledge of the speakers or noise types present. Previous approaches have shown the utility of a sparse reconstruction of the speech-only components of an observed noisy utterance. We demonstrate that an underlying speech representation which, in addition to applying sparsity, also adapts to the noisy acoustics improves overall enhancement quality. The proposed system performs comparably to a traditional Wiener filtering approach, and the results suggest that the proposed framework is most useful in moderate- to low-SNR scenarios.

#10 Inventory-based audio-visual speech enhancement [PDF] [Copy] [Kimi1]

Authors: Dorothea Kolossa ; Robert Nickel ; Steffen Zeiler ; Rainer Martin

In this paper we propose to combine audio-visual speech recognition with inventory-based speech synthesis for speech enhancement. Unlike traditional filtering-based speech enhancement, inventory-based speech synthesis avoids the usual trade-off between noise reduction and consequential speech distortion. For this purpose, the processed speech signal is composed from a given speech inventory which contains snippets of speech from a targeted speaker. However, the combination of speech recognition and synthesis is susceptible to noise as recognition errors can lead to a suboptimal selection of speech segments. The search for fitting clean speech segments can be significantly improved when audio-visual information is utilized by means of a coupled HMM recognizer and an uncertainty decoding framework. First results using this novel system are reported in terms of several instrumental measures for three types of noise.

#11 Utilization of the lombard effect in post-.ltering for intelligibility enhancement of telephone speech [PDF] [Copy] [Kimi1]

Authors: Emma Jokinen ; Paavo Alku ; Martti Vainio

Post-filtering methods are used in mobile communications to improve the quality and intelligibility of speech. This paper introduces a noiseadaptive post-filtering algorithm that models the spectral effects observed in natural Lombard speech. The proposed method and another postfiltering technique were compared to unprocessed speech and natural Lombard speech in subjective listening tests in terms of intelligibility and quality. The results indicate that the proposed method outperforms the reference method in difficult noise conditions.

#12 Speech enhancement by online non-negative spectrogram decomposition in nonstationary noise environments [PDF] [Copy] [Kimi1]

Authors: Zhiyao Duan ; Gautham J. Mysore ; Paris Smaragdis

Classical single-channel speech enhancement algorithms have two convenient properties: they require pre-learning the noise model but not the speech model, and they work online. How- ever, they often have difficulties in dealing with non-stationary noise sources. Source separation algorithms based on non- negative spectrogram decompositions are capable of dealing with non-stationary noise, but do not possess the aforemen- tioned properties. In this paper we present a novel algorithm that combines the advantages of both classical algorithms and nonnegative spectrogram decomposition algorithms. Experi- ments show that it significantly outperforms four categories of classical algorithms in non-stationary noise environments.

#13 Sibilant speech detection in noise [PDF] [Copy] [Kimi1]

Authors: Sira Gonzalez ; Mike Brookes

We present an algorithm for identifying the location of sibilant phones in noisy speech. Our algorithm does not attempt to identify sibilant onsets and offsets directly but instead detects a sustained increase in power over the entire duration of a sibilant phone. The normalized estimate of the sibilant power in each of 14 frequency bands forms the input to two Gaussian mixture models that are trained on sibilant and non-sibilant frames respectively. The likelihood ratio of the two models is then used to classify each frame. We evaluate the performance of our algorithm on the TIMIT database and demonstrate that the classification accuracy is over 80% at 0 dB signal to noise ratio for additive white noise.

#14 Voice activity detection using speech recognizer feedback [PDF] [Copy] [Kimi1]

Authors: Kit Thambiratnam ; Weiwu Zhu ; Frank Seide

This paper demonstrates how feedback from a speech recognizer can be leveraged to improve Voice Activity Detection (VAD) for online speech recognition. First, reliably transcribed segments of audio are fed back by the recognizer as supervision for VAD model adaptation. This allows the much stronger LVCSR acoustic models to be harnessed without adding computation. Second, when to make a VAD decision is dictated by the recognizer not the VAD module, allowing an implicit dynamic look-ahead for VAD. This improves robustness but can be gracefully reduced to meet latency requirements if necessary without requiring retraining/retuning of the VAD module. Experiments on telephone conversations yielded a 6.7% abs. reduction in frame classification error rate when feedback was applied to HMM-based VAD and a 4.2% abs. reduction over the best baseline system. Furthermore, a 3.0% abs. WER reduction was achieved over the best baseline in speech recognition experiments.

#15 Descriptive vocabulary development for degraded speech [PDF] [Copy] [Kimi1]

Authors: Dushyant Sharma ; Gaston Hilkhuysen ; Patrick A. Naylor ; Nikolay D. Gaubitch ; Mark Huckvale ; Mike Brookes

This paper presents the development of a compact vocabulary for describing the audible characteristics of degraded speech. An experiment was conducted with 51 English-speaking subjects who were tasked with assigning one of a list of given text descriptors to 220 degradation conditions. Exploratory data analysis using hierarchical clustering resulted in a compact vocabulary of 10 classes, which was further validated by a bootstrap cluster analysis.

#16 Overlapped speech detection in meeting using cross-channel spectral subtraction and spectrum similarity [PDF] [Copy] [Kimi1]

Authors: Ryo Yokoyama ; Yu Nasu ; Koichi Shinoda ; Koji Iwano

We propose an overlapped speech detection method for speech recognition and speaker diarization of meetings, where each speaker wears a lapel microphone. Two novel features are utilized as inputs for a GMM-based detector. One is speech power after cross-channel spectral subtraction which reduces the power from the other speakers. The other is an amplitude spectral cosine correlation coefficient which effectively extracts the correlation of spectral components in a rather quiet condition. We evaluated our method using a meeting speech corpus of four persons. The accuracy of our proposed method, 74.1%, was significantly better than that of the conventional method, 67.0%, which uses raw speech power and power spectral Pearson's correlation coefficient.

#17 Speech restoration based on deep learning autoencoder with layer-wised pretraining [PDF] [Copy] [Kimi1]

Authors: Xugang Lu ; Shigeki Matsuda ; Chiori Hori ; Hideki Kashioka

Neural network can be used to “remember” speech patterns by encoding speech statistical regularity in network parameters. Clean speech can be “recalled” when noisy speech is input to the network. Adding more hidden layers can increase network capacity. But when the hidden layer size increases (deep network), the network is easily to be trapped to a local solution when traditional training strategy is used. Therefore, the performance of using a deep network sometimes is even worse than using a shallow network. In this study, we explore the greedy layer-wised pretraining strategy to train a deep autoencoder (DAE) for speech restoration, and apply the restored speech for noisy robust speech recognition. The DAE is first pretrained using quasi-Newton optimization algorithm layer by layer in which each layer is regarded as a shallow autoencoder. And the output of the preceding layer is served as the input to the next layer. The pretrained layers are stacked and “unrolled” to be a DAE. The pretrained parameters are served as initial parameters of the DAE which are used to refine training. The trained DAE is used as a filter for speech restoration when noisy speech is given. Noisy robust speech recognition experiments were done to examine the performance of the trained deep network. Experimental results show that the DAE trained with pretraining process significantly improved the performance of speech restoration from noisy input.

#18 Detection and positioning of overlapped sounds in a room environment [PDF] [Copy] [Kimi1]

Authors: Rupayan Chakraborty ; Climent Nadeu ; Taras Butko

The description of the acoustic activity in a room environment has to face the problem of overlapped sounds, i.e. those which occur simultaneously. That problem can be tackled by carrying out some kind of source signal separation followed by the detection and recognition of the identity of each of the overlapped sounds. An alternative approach relies on modeling all possible overlapping combinations of acoustic events. For a spatial scene description, there is still the need of assigning each of the detected acoustic events to one of the estimated source positions. Both detection approaches are tested in our work for the case of two simultaneous sources, one of which is speech, and an array of three microphones. Blind source separation based on the deflation method and null steering beamforming are used for signal separation. Also a position assignment system is developed and tested in the same experimental scenario. It is based on the above mentioned beamformer and takes the decision based on a likelihood ratio. Both signal-level fusion and likelihood fusion are tried to combine the information from the two pairs of microphones. The reported experimental results illustrate the possibilities of the various implemented techniques.

#19 Foreground speech segmentation using zero frequency filtered signal [PDF] [Copy] [Kimi1]

Authors: K. T. Deepak ; Biswajit Dev Sarma ; S. R. Mahadeva Prasanna

A method for the robust segmentation of foreground speech in the presence of background degradation using zero frequency filtered signal (ZFFS) is proposed. The speech signal from the desired speaker collected over a mobile phone is termed as foreground speech and the acoustic background picked by the same sensor that includes both speech and non-speech sources is termed as background degradation. The zero frequency filtering (ZFF) of speech allows only information around the zero frequency to pass through. The features from the resulting ZFFS, namely, the normalized first order autocorrelation coefficient and the strength of excitation of ZFFS are observed to be different for foreground speech and background degradation. A method for foreground speech segmentation is developed using these two features. The evaluation using utterances containing isolated words of foreground speech and background degradation collected in a real environment shows a robust foreground speech segmentation.

#20 The effect of spectral estimator on common spectral measures for sibilant fricatives [PDF] [Copy] [Kimi1]

Authors: Patrick Reidy ; Mary Beckman

Recently, speech researchers have begun to base spectral analyses of sibilant fricatives on modern spectral estimators that promise reduced error in the estimation of the spectrum of the acoustic waveform. In this paper we look at the effect that the choice of spectral estimator has on the estimation of spectral properties of English voiceless sibilant fricatives.

#21 Gaussian mixture gain priors for regularized nonnegative matrix factorization in single- channel source separation [PDF] [Copy] [Kimi1]

Authors: Emad M. Grais ; Hakan Erdogan

We propose a new method to incorporate statistical priors on the solution of the nonnegative matrix factorization (NMF) for single-channel source separation (SCSS) applications. The Gaussian mixture model (GMM) is used as a log-normalized gain prior model for the NMF solution. The normalization makes the prior models energy independent. In NMF based SCSS, NMF is used to decompose the spectra of the observed mixed signal as a weighted linear combination of a set of trained basis vectors. In this work, the NMF decomposition weights are enforced to consider statistical prior information on the weight combination patterns that the trained basis vectors can jointly receive for each source in the observed mixed signal. The NMF solutions for the weights are encouraged to increase the log likelihood with the trained gain prior GMMs while reducing the NMF reconstruction error at the same time.

#22 Speaker independent single channel source separation using sinusoidal features [PDF] [Copy] [Kimi1]

Authors: Shivesh Ranjan ; Karen L. Payton ; Pejman Mowlaee

Model-based approaches to achieve Single Channel Source Separation (SCSS) have been reasonably successful at separating two sources. However, most of the currently used model-based approaches require pre-trained speaker specific models in order to perform the separation. Often, insufficient or no prior training data may be available to develop such speaker specific models, necessitating the use of a speaker independent approach to SCSS. This paper proposes a speaker independent approach to SCSS using sinusoidal features. The algorithm develops speaker models for novel speakers from the speech mixtures under test, using prior training data available from other speakers. An iterative scheme improves the models with respect to the novel speakers present in the test mixtures. Experimental results indicate improved separation performance as measured by the Perceptual Evaluation of Speech Quality (PESQ) scores of the separated sources.

#23 Boosting classification based speech separation using temporal dynamics [PDF] [Copy] [Kimi1]

Authors: Yuxuan Wang ; DeLiang Wang

Significant advances in speech separation have been made by formulating it as a classification problem, where the desired output is the ideal binary mask (IBM). Previous work does not explicitly model the correlation between neighboring time-frequency units and standard binary classifiers are used. As one of the most important characteristics of speech signal is its temporal dynamics, the IBM contains highly structured, instead of, random patterns. In this study, we incorporate temporal dynamics into classification by employing structured output learning. In particular, we use linear-chain structured perceptrons to account for the interactions of neighboring labels in time. However, the performance of structured perceptrons largely depends on the linear separability of features. To address this problem, we employ pretrained deep neural networks to automatically learn effective feature functions for structured perceptrons. The experiments show that the proposed system significantly outperforms previous IBM estimation systems.

#24 Acoustic features for classification based speech separation [PDF] [Copy] [Kimi]

Authors: Yuxuan Wang ; Kun Han ; DeLiang Wang

Speech separation can be effectively formulated as a binary classification problem. A classification based system produces a binary mask using acoustic features in each time-frequency unit. So far, only pitch and amplitude modulation spectrogram have been used as unit level features. In this paper, we study other acoustic features and show that they can significantly improve both voiced and unvoiced speech separation performance. To further explore complementarity in terms of discriminative power, we propose a group Lasso approach for feature combination. The final combined feature set yields promising results in both matched and unmatched test conditions.

#25 Hidden Markov models as priors for regularized nonnegative matrix factorization in single-channel source separation [PDF] [Copy] [Kimi1]

Authors: Emad M. Grais ; Hakan Erdogan

We propose a new method to incorporate rich statistical priors, modeling temporal gain sequences in the solutions of nonnegative matrix factorization (NMF). The proposed method can be used for single-channel source separation (SCSS) applications. In NMF based SCSS, NMF is used to decompose the spectra of the observed mixed signal as a weighted linear combination of a set of trained basis vectors. In this work, the NMF decomposition weights are enforced to consider statistical and temporal prior information on the weight combination patterns that the trained basis vectors can jointly receive for each source in the observed mixed signal. The Hidden Markov Model (HMM) is used as a lognormalized gain "weights" prior model for the NMF solution. The normalization makes the prior models energy independent. HMM is used as a rich model that characterizes the statistics of sequential data. The NMF solutions for the weights are encouraged to increase the log-likelihood with the trained gain prior HMMs while reducing the NMF reconstruction error at the same time.